o 



X 



Generative and Latent Mean Map Kernels 

Nishant A. Mehta* and Alexander G. Gray 

School of Computer Science 

College of Computing 

Georgia Institute of Technology 

Atlanta, Georgia 



May 4, 2010 



Abstract 



^^ We introduce two kernels that extend the mean map, which embeds probability measures in Hilbert 
CO spaces. The generative mean map kernel (GMMK) is a smooth similarity measure between probabilistic 
models. The latent mean map kernel (LMMK) generalizes the non-iid formulation of Hilbert space 

Oembeddings of empirical distributions in order to incorporate latent variable models. When comparing 
certain classes of distributions, the GMMK exhibits beneficial regularization and generalization properties 
I— I not shown for previous generative kernels. We present experiments comparing support vector machine 

C/2 performance using the GMMK and LMMK between hidden Markov models to the performance of other 

O methods on discrete and continuous observation sequence data. The results suggest that, in many cases, 

the GMMK has generalization error competitive with or better than other methods. 
I Keywords: Kernel methods, graphical models, complexity 
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00 1 Introduction 

^D Generative kernels offer an elegant way to apply kernel methods for classification, clustering, and manifold 

If) learning to distributions, and they are particularly useful for applying kernel methods to non-iid data. By 

^^ using kernels that incorporate statistical dependence information, we can leverage a rich set of existing 

^^ methods such as kernel support vector machines (SVMs) and kernel principal components analysis (kPCA) 

. . [19j to learn from this data. As an example, we consider sequence classification as a particular instance of 

^ learning from non-iid data. In sequence classification, the goal is to label a sequence {Xi, X2, . . . , Xt) with 

one label 1", where the example sequences can be of varying lengths and the ordering of the observations 
within a sequence informs non-trivial dependencies. 
Cd SVMs using nonlinear kernels such as the polynomial and Gaussian kernels have performed strongly for 

a variety of non-sequential data classification tasks [T^ . More recent work has applied kernels to measure 
similarity between sequences via similarity between generative models trained on those sequences [11] or 
by making use of metrics on statistical manifolds O US] . Guilbart [7] and later Suquet [21] (also see [2S] 
for a related English-language paper) previously extended reproducing kernels to kernels on bounded signed 
measures. One particular kernel of Hein and Bousquet [5], Structural Kernel I, generalizes the kernel of 
Suquet. Unfortunately, efficient computation of this kernel family for many interesting distributions is 
highly non-trivial. 

We provide a new derivation for the natural special case of Structural Kernel I described by Guilbart 
and Suquet, provide concrete examples for how it can be computed for several distributions relevant to the 
machine learning community, and provide for the first time generalization error guarantees when learning with 
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this kernel. We also introduce a second kernel that naturally extends the idea of Hilbert space embeddings 
of empirical clique distributions (the empirical mean map) to graphical models with latent variables. 

For lack of a generally accepted name for this kernel, we refer to it in this work as the generative 
mean map kernel (GMMK). The GMMK measures similarity between observations by providing a nonlinear 
similarity between generative models estimated from each observation. We show in Section [6] that the 
GMMK has unique learning theoretic advantages. We also introduce the latent MMK (LMMK), which 
measures similarity between pairs of structured data observations with respect to a single (global) generative 
model 6. This is accomplished by measuring the similarity of sufficient empirical and posterior distributions 
from two structured data observations. 

We begin by reviewing Hilbert space embeddings of distributions via the mean map and then introduce 
the generative mean map kernel. In Section [3] we show how to compute the GMMK for several widely-used 
distributions, and in Section [4] we discuss the LMMK, an extension of the non-iid empirical MMK for latent 
variable models. We form connections between these two kernels and other kernels in Section |5l We then 
analyze some theoretical properties of the GMMK before concluding with promising results on sequence 
classification and learning a species manifold from biodiversity data. 

2 The Mean Map 

Although the concept dates back quite a bit [H [71 |2l] , the phrase mean map was coined by Smola et al. [20] 
as a Hilbert space embedding of empirical distributions. For X a domain of observations with probability 
measure P^ and X = {xi, . . . , Xm} C A" a set of m samples drawn iid, consider the reproducing kernel Hilbert 
space (RKHS) H with feature map (f> : X ^ H, where ^(a;) = k{x, •) and from the reproducing property 
we have {f,4>{x)) — f{x) and {(f){x) , (l){y)) = k(x,y). The mean map fi of the true (Px) and empirical (X) 
distributions respectively are defined as 

fi[Px]:^Ex[k{x,-)] (1) 

^i[x]■. = -J2H^^,■)■ (2) 

The operator [i maps distributions to elements of the RKHS and is injective for RKHSs induced by universal 
kernels [21], such as the Gaussian radial basis function (RBF) kernel k{x,y) = exp(— A||x — j/|p) and the 
Laplace kernel k{x, y) = exp(— ^^"^^ \xi — yi\) for A, /3 e M+. 

Some previous works [5J|3D] exploited the linear convergence of /i[X] to ^i[Px\ in order to compute kernels 
between two sets of samples; the methods introduced here are departures from these applications of the 
mean map. After briefly describing the empirical mean map kernel (EMMK) we connect it to the GMMK. 

2.1 Empirical mean map kernel 

The empirical mean map is an injective mapping (for universal base kernels) of empirical probability distri- 
butions into an RKHS. The method has been applied to iid observations and also non-iid observations [30] 
by fixing a dependency model for the observations and considering the empirical distributions induced by 
the maximal cliques of this model. The empirical mean map of non-iid data can be decomposed into the 
sum of the empirical mean maps for each maximal clique's distribution. 

Following the notation from 30J, for a graphical model G with variable set Z and maximal cliques set C, 
let Vc be universal kernels on the variable subset of Z induced by clique c G C. Then 

v{z,z')^^Vc{zc,z'^) (3) 

cec 

embeds all probability distributions with the specified conditional independence relations using an exponen- 
tial family model with kernel v ^j. 



A limitation of the empirical mean map is that it operates only on observed data and hence cannot handle 
latent variable dependency models. Critically, using dependency models over only observable variables can 
be quite restrictive, as highlighted by the strong performance of latent variable models such as hidden Markov 
models (HMMs) for many learning problems. Also, note that although Zhang et al. [30] used the empirical 
mean map with latent variable models, the latent variables were used to optimize a kernel-based objective 
rather than to model the dependency explicitly. Also, while the EMMK previously has been used to measure 
and optimize dependence [3 EHl H] ; it has not yet been used for classification. 

3 The Generative Mean Map Kernel 

3.1 The generative mean map 

Suppose we are given two objects x and y. These objects could be documents, sequences, images, or points 
in M". For all but the last case, special effort is necessary to form a suitable kernel between the objects that 
captures their underlying similarity well. We can modify the mean map in ([2]) such that the expectation 
is evaluated for x ^ P^ rather than x ^ P" (for P" the empirical distribution induced by the sample X), 
where Px is an estimated probabilistic model of x. This modification induces the generative mean map 

m[P.] = E_pJ0(x)]. (4) 

For Px learned from x and Py learned from y, let the generative mean map kernel be 

{^l[Px],^l[Py]) = £^^p^^y^p^[k{x,y)] (5) 

Px{x)Py{y)k{x,y)dxdy. (6) 



Though it has already been shown in various other works that this kernel is pd OH |211 E] , for completeness 
we provide a short proof. 

Proposition 1. The generative mean map kernel with positive definite (pd) kernel k (i.e. k >z 0) also is pd. 

Proof. For k pd, there exists a feature map (/> : A" — > "H. The GMMK is an inner product between mean 
elements in "H, for mean elements /i[p] = J p{x)(f){x)dx. By identification of the inner product between mean 
elements, given in Q, the GMMK is pd. D 

Note that the EMMK is a kernel between sets of points (which induce empirical distributions) whereas 
the GMMK is a kernel between functions. We explore this kernel for examples of generative models of 
increasing structure. 

3.2 Examples 

3.2.1 Discrete distribution 

For p and p' discrete distributions with mean parameters a = (ai,...,afe) and a' — (a'j^, ... ,a^) respectively, 
the GMMK is 

k k 

km{p,p) = ^^a,,a^exp(-A(l - 6ij)), (7) 

t=i j=i 

where 5i_j — 1 if i = j and otherwise. 

It would be of considerable benefit to compute this kernel for multinomial distributions; it currently is 
open whether this case admits a closed form expression. 



3.2.2 Gaussian distribution 

Lemma 1. Let p and p' he multivariate Gaussian probability measures Af{fj,,T,) and Af{fi' ,"£') respectively. 
For the Gaussian RBF kernel k{x,x') = exp(— |A||a; — a^'lP), the GMMK for p and p' is 

K^m[P,P)-JJ (27r)''/2|E|i/2 (27r)'^/2|E'|i/2 ic[x,x)axax 

exp(-i(/3^a-i;9-(5)) 
which can be computed in closed form as — ; z , ,, .„ — , (8) 



The proof follows from a multitude of linear algebra identities. 

3.2.3 Markov models and the connection to EMMK 

Suppose that each observation is a sequence of elements of M: 

Xi = {x\ % . . . ,x> "') for aU i e [n]. 

For both the generative mean map and the non-iid extension of the empirical mean map, let us assume the 
first-order Markov dependency model 

t=i 

For the empirical mean map, we make no assumption on the form of Pfu_i '■— P (xl {x^ ) , whereas 

for the generative mean map, we explicitly estimate Pf:u_i as PfU_i- 
The empirical mean map is 

* t=i 

whereas the generative mean map is 

T 

^ixu-,^T)~P(^^Wi.xi, ■ ■ ■ ,xt))] = / Y[P^^^_^cl){{xi, . . . ,xt)) dxi . . . dxT (11) 



for T a free parameter. The kernel for each map is simply the inner product in that map's feature space. 
We now show by example that the GMMK can be computed efficiently for various graphical models. 



^We have confirmed that the exponential term is symmetric in p and p' . See the simpler isotropic case in psl. 



3.2.4 Hidden Markov models 

For a hidden Marlcov model (HMM) with probability measure p, let q = (go, • ■ ■ , 9t) be the latent random 
variables and x = (xq, . . . , xt) be the observable random variables. We similarly define p' , q', and x' for a 
second HMM. 

Suppose that we have learned HMMs with probability measures p and p' and wish to compute the GMMK 
kni{p,p') for the observable variables of length T sequences drawn from these HMMs. The parameter T serves 
as a witness length which allows control over the length of the sequences to be embedded in the RKHS (a 
larger T translates to less weight on the models' initial conditions). The next result establishes the complexity 
of an efficient algorithm to compute the kernel. 

Lemma 2. The generative mean map kernel between an HMM of n states with probability measure p and 
an HMM of n' states with probability measure p' can be computed in time: 

1. 0{n^T + n'^k^) for a discrete observation HMM on k symbols with n' — n. 

2. 0{n^T + n^rn^ pd) for a continuous observation HMM with mixture of Gaussians state distributions, 
n' — n, and m' ~ m, for d the observation dimensionality, m and m' the number of Gaussians in the 
mixtures, and pd the cost of inverting a covariance matrix (pd = d for diagonal case). 

Proof. The quantity of interest is 

k^{p,p')= J2 Mq,xy(q',x')fc(x,x'). (12) 

x,x',q,q' 

We treat the discrete and continuous observation cases simultaneously in the following way. For the discrete 
case, we use the 1-of-fc encoding wherein, if the t*^ observation takes on value i out of k possible values, 
then Xt S {0, l}'^ and [xt]j = Sij, for all j G [k]. In the derivation for the discrete case belo'wrl we use the 
Gaussian RBF base kernel's isotropicity such that the kernel factorizes as fc(x, x') ~ Y[t=i k{xt,x[). Also, 
the linear chain structure of the HMM graphical model can be used to factorize the expectation 

km{p,p) = ^ p{xT I qr) ^ p'ix'rp I qT)k{xT,x'rr) 
qT,XT q!j,,3:'j, 

T-l 
n J2p^'1*+^ I 9*)p'(9t+i I It) Yl P^^* I 9*)p'(^f I <}t)Hxt,Xt)p{qo)p'{qQ) 



t=0 



qt.qi Xt,Xt 

T-1 



= E ^(it^q't) n Ep(*+i I <lt)T.P'(lt+i I <lt)^iluq'Mqo)p'iq'o)- (13) 

qT,q!r *=o It q't 

Now, '\li{qt,qt) = X^a; x' Pi^t I qt)p'{x[ I q[)k{xtTx'^) is itself a GMMK on the state distributions for qt and 
q't. These kernels need be computed only once for each pair of states {qt,q't), yielding cost n^ times the 
complexity to evaluate this kernel once, which is 0{k^) for the discrete distribution and 0{m?pd) for a 



mixture of Gaussians distribution. From the factorized structure of the rest of the computation in ( 13 ) we 
see that it is 0{Tn^) (see the algorithm in Figure [I]), as all 0{T) latent variable marginalizations are done 
over functions of at most 3 variables. D 

Generally, sequences for which one would like a similarity measure are of different lengths, precluding 
computation of a kernel without resorting to truncation or other compromises. Even for sequences of the same 
length, a priori there is no reason why the sample indices of the sequences should be considered aligned; 
application of standard kernels invariably relies upon distance computations made between mismatched 
random variables. The GMMK addresses this issue by first learning a generative model for each sequence 

•^For continuous observations, sums become integrals. 



for i == 1 to n do 




for j = 1 to n' do 




ipji = kn,{p{x\q = i),p{x'\q' 
= rnv'T 


= J)) 


cf) = cf) » xp 




for i = 1 to T do 




= (A''^(/)A) • V 




return Er=i E"=i '/'r^ 





Figure 1: We show how to compute the GMMK for HMMs. • is the Hadamard product, [A]ij — P{qt+i ~ j \ Qt ~ i), 
and [tt],; = P{qo = i). 

and then performing kernel computations on the expected sequences that result from each generative model. 
While sequences drawn from similar distributions can appear to be very different due to the stochastic nature 
of their generation, by using a measure between the distributions of sequences themselves, we bypass this 
problem and achieve a more robust similarity measure. 

3.2.5 Linear dynamic systems 

The GMMK also can be computed for linear dynamic systems of the form 

qt+i = Aqt +Wt Wt^ M{0, If] (14) 

xt^Cqt+vt vt^N{Q,R), (15) 

where A : M*'' ^ M.^ and C : M*^ — >■ M" are linear operators and i? is a covariance matrix. 

The computation follows almost directly from the formulation of Jebara and Kondor ,llj for the prob- 
ability product kernel (described in Section pi). Briefly, for p and p' being probability measures over linear 
dynamic systems, k^{p,p') can be shown to be the GMMK between two Gaussians. In particular, the two 
Gaussians are Af{p,x, ^xx) and N{p.x', '^x'x'), where ^x — {^J'xo, ■ ■ ■ P'xt)j ^xx is a block diagonal matrix with 
blocks Sjjj^j^j, and we have the following recursive updates: 

fJ-qo '■= fJ- ^^qt '■= '^^^qt ^'3t + l:9t+l •= ^^«t,«t^ + -^ 

A*Xt '■=Cl^J'qt ^Xt.Xt '■— CYiq^^q^C + R. 

3.2.6 Kernel density estimators 

The GMMK can be used as a kernel on nonparametric density estimators. The idea of kernels between 
density models of sets has been explored previously by Jebara and Kondor [T^] with the Bhattacharyya 
kernel . Whereas they implicitly map the data to an RKHS using the Gaussian kernel and then learn single 
Gaussian models in the feature space, here we use kernel density estimators (KDEs) in the original space. 
An advantage of using kernel density estimation is that it is known to be consistent [27| . 
Let f^{z) beaKDE 

fz{x)^ — Vfc^^(z,,x), (16) 

2 — 1 

where hx is the bandwidth. The GMMK between two Gaussian RBF KDEs fx on observations X = 
(xi, . . . , Xm^) and fy on observations F = (j/i, . . . , ymy) is then 

(m[/], /"[/']> = E^^f [kix.x')] = y]y] / kh^ixi^x) / khy{yj,x')k{x,x')dxdx'. (17) 

x'~f' " i=l J = l 



'Note that we can use identity covariance (I) without loss of generality (as per footnote 4 of |17|). 



For k the Gaussian RBF kernel, this expression only requires evaluations of the GMMK on isotropic 
Gaussians. The form for general Gaussians is in (Is]). For two A^-dimcnsional isotropic Gaussians M{^,hl) 
and A/'(/i', /i'/), the GMMK admits the more pleasant form 



[i + \[h + h'))^r- 



■ exp 



'l|2 



21 + \{h + h') 



N/2 



exp 



"0 



l A||/i-MlP 
2 ho 



(18) 



where /iq := 1 + \{h + h'). Substituting ([T8| for the integrals in ^j yields KDE GMMK closed form 

™x ™H / 1 Ml 112 \ 



1 



.N/2 



rnxmyfiQ ^^^ j^^ 



'■0 



4 The Latent Mean Map Kernel 

4.1 Empirical mean map limitations 

By relaxing the dependency models of the empirical mean map kernel to include latent variables, we generalize 
the kernel to include richer dependency models such as dynamic Bayesian networks and hidden Markov 
random fields. As with the non-iid empirical mean map, we need only apply the mean map to the distribution 
of each maximal clique. The maximal cliques now fall into two sets: fully observable cliques and cliques 
containing at least one latent random variable. The distributions for the former cliques can be computed 
empirically similar to |30j : however, applying the empirical mean map to the distributions of the latter cliques 
is impossible due to the latent variables. 

4.2 The Latent mean map 

The latent mean map augments the empirical mean map by using the posterior distribution of the latent 
variables (with respect to a model specified by 0), conditional on the observed variables. For conciseness, all 
expectations in this section implicitly are made with respect to a single model 9; this 9 should be estimated 
from the examples, or a subset of the examples, that are being embedded into a Hilbert space. 

Let (u, v) be the concatenation of the components of vectors u and v to form a higher dimensional vector. 
For observed variables X, latent variables Y, and clique-restricted subsets X^ and Y^ , the latent mean map 

of (x„ y,) is 



1 



Hc[{X,,Y,)] = E(^^,^^)|,^.„. [4>,{{X,,Y,))] - — VE 

V L- 1 L, / I J. . I J I y J 



Yc^'^ki 



^c({x^,Y^'^) 



(20) 
(21) 



for Yc ^ PiYc I xi:m)^ the posterior distribution of the random variable Yc conditioned on all observa- 
tions xi;m- This expression captures our best estimate of the clique distribution. 



From ( 20 ) , the latent mean map kernel 

Y,{l^,[{X.,Y,)],Hc[{X',X)]) 



(22) 



cec 



expands to 



^ mem' ^ ^ ^c'JI^i- 



.,((xW,y«),(x^o),y;W) 



(23) 



Our end goal is to compute the kernel on many object pairs for SVM classification or kPCA, but even 
moderate rric and m^ render the above expectation intractable. An often exploited trick of kernel methods is 



the ability to compute inner products in a potentially infinite dimensional space without the need for explicit 
representations in that space. Here, however, an approximate explicit representation yields computational 



tractability by allowing us to work with the efhcicnt form in (20 1. For example, the Gaussian RBF kernel 
on univariate continuous data admits a truncated Taylor expansion of the exponential [531 Theorem 4.6], 
empirically yielding low error for low order truncations I29J . For multivariate data, two recent explicit 
representations approximate the RKHS using random features, with error decreasing exponentially in the 
number of features chosen [1^1 . 

It may be useful to use nonlinear representations even in the space of distributional Hilbert space em- 
beddings. Given a latent mean map kernel matrix isT, this can be accomplished by using the alternate kernel 
matrix K such that 

K,j = eyiv{~v{K,, ~ 2K,j + K^j)), (24) 

for some parameter v. In our LMMK experiments, we push A toward infinity and consider different values 
of v rather than A. 

4.3 Latent mean map of HMMs 

Learning using the latent MMK requires a dependency model to induce latent variables and a set of con- 
ditional distributions sufficient for the model. This model identifies a set of maximal cliques and allows us 
to compute the posteriors. Suppose we have an HMM 6 as described earlier. Assuming stationarity, the 
model's maximal cliques (xt,qt) and {qt,qt+i) yield T instances of the former and T — 1 instances of the 
latter clique: 

T ^ T N 



i=l t = l i=l 

1 ^^^ 
f^qq[{qt,qt+l)] = Y^ 51 ^Qt:Qt+l\x.e [(t>qq{{Qt,Qt+l))] 
i=l 

T-1 N N 

3T E E E p(Q* = *' Q*+i = ■?■ I ^' ^)<^9?((Q*' Q*+i))- 



T- . 

t=i i=i j=i 

The forward-backward algorithm can compute the conditional probabilities jl5]. We adopt Rabiner's nota- 
tion p!^ for the conditional probabilities so that we have 

7tW = P(Qt = i|x,0), (25) 

Ct{i,j) = P{Qt = t,Qt+i=j\x,9). (26) 

We use the joint kernel on cliques Vc{{xc,yc)i {^'ciUc)) = kc{xc, x'^)lc{yci Vc) such that 

Vxq{{xt,qt),{x't,qt)) = kx{xt,x[)lq{qt,qt), (27) 

and likewise 

Vqq{{qt,qt+i), {q't, q't+J) = Iqiqt, q't)lq{qt+i,q't+i)- (28) 

As before, we use the 1-of-fc encoding to treat the discrete and continuous cases identically. The kernel k will 
be the Gaussian RBF kernel. For an A^-state latent space, K possible symbols, and 7(c) (j) '■— Y^t-x =c^tiJ)' 
the kernel v^q is 

^ E7(a)(*)f7[a)(*) + e-^ E 7(.)(j) + e-^ E Ub)i^) + ^'^ E 7[.)(j))), (29) 

ae[K] \ J&IN]V be[K]\a ^ j<i[N]\i ^ / 

ie[iv] 



which has O(N^K^T) complexity. The continuous observation case of v^q is 

i=l \ \s=l t=l I j=l \s=l t=l ' / 

= ^((E^(^-^*)E7^W7*«)+^"'(E^(^-^*)E^^(*ho')))- (31) 

Whereas ( [3l| ) is 0{N^T^), this can be reduced by exphcitly representing RKHS elements in (30|. Defining 
£,{i,j) '■— ^^ X]teT^*(*'^)' ^^'^ kernel on the random variable clique {qt,qt+i) Vqq is 



ahj)U{i,3) + e-^ J2 e{hj') + e-' Y. Ui^'^j)+^'' E ^'(*''/)) 
1 V i'e\N]\j i'e\m\i^ j'eimvj ' . 



E^(*'^')|^'(*'J') + '^"' E e'(*,/) + e-^ 1^ k'(^',j)+e-^ Y^ a^J)\V (32) 

je[JV] V j'e[JV]\j j'e[7V]\j^ i'e[iv]\j 



Interestingly, as A approaches infinity, the LMMK on HMMs takes a form similar to the plain Fisher kernel 
on HMMs [H] (i.e., when the Fisher information matrix is replaced by the identity matrix). From our results 
it will appear that the differences in the computation of two kernels significantly affect their performance. 

5 Related Work 

5.1 Probability product kernel 

For probability measures p and p' and x ^ X ^ the probability product kernel [TT] is 

fcp(p,p') = / vixYl^^xYdx. (33) 

Jx 

The probability product kernel (PPK) is a special case of the GMMK. 
For the case where p = 1, the following Proposition easily follows. 

Proposition 2. The probability product kernel with p — 1 is a special case of the generative mean map 
kernel with convergence exponential in X as A — >■ oo. 

Proof. We use the convergence of the scalecH Gaussian kernel kcix, x) — vAexp(— A||a; — x'jp) to the identity 
kernel ks(x, x') — 5{x — x') as A — > oo. The PPK lim kai{p,p') expands to 

A— foo 

lim / / pix)p'ix')kaix,x')dxdx'= lim / / pW(.')v/Aexp(-A||. - .T) ^-^-'. 



A — >00 J -y J X' A— >-CX3 J -y J V 



Now, using (x, x') i— )■ (a, b) \— (x, x — x') and making the substitution a := -r- yields 

V A 

lim / / p(a)p'(a-6)-exp(-|16|p/cr2)dad&= p{a)p'{a)da = kJp,p'). D 

Further, it is possible to express a GMMK analog to the PPK for the case of p 7^ 1; however, the 
expectation operator is fundamental to the GMMK's derivation and this operator requires p — 1- Provided 
that the GMMK can be computed for the distribution of the observed random variables, graphical models 
for which the PPK is computable are also computable for the GMMK; this can be seen by observing that 
the Gaussian kernel couples each pair of observed variables Xi and x'^ into a 2-clique. In the clique graph 
used by the junction tree algorithm, this 2-clique appears wherever Xi appears in the PPK's clique graph. 



This scaling is only for the technical reason of the limit in the last line of the proof. 



Another connection between the GMMK km{p,p') and the PPK kp is that km is an expectation of 
kp between one fixed distribution and an isotropic Gaussian centered at the points drawn from the other 
distribution. For a = (2A)~^/^, we easily see that 

E,^p,[kp{p,M{x,a^))] = E,^p[kp{p',Mix,a^))]. (34) 

5.2 Other related kernels 

As mentioned earher, the concept of embedding probabihty distributions into Hilbert spaces via the expec- 
tation of a feature map dates back to Guilbart [7] . The concept was explored further by Suquet [211 l2Sjj 
Hein and Bousquet [8] later generalized this kernel by replacing the scalar product of the measures by a 
positive definite function between the measures; however, their work did not discuss the learning framework 
for the case of structured data such as images and time series, and so far the complexity of the associated 
RKHS has not been quantified in terms amenable for generalization error guarantees for empirical risk min- 
imization. The expectation of a feature map is similar to the marginalized kernel |26j , although that work 
does not discuss RKHSs and focuses on count kernels rather than higher order (Taylor type) kernels such 
as the Gaussian RBF kernel. The derivation of the GMMK also is different, coming from the direction of 
Hilbert space embeddings of distributions to arrive at a kernel with good regularization properties. 

The EMMK relies upon observing the labels in order to include them in a mean map of the full joint 
distribution over observed variables and labels (latent variables). Without learning a generative model with 
latent state variables, the EMMK is limited to graphical model dependencies between only the observed 
variables (see the discussion in Section 111). 

To our knowledge, the Bhattacharyya affinity is the earliest form of a similarity measure between prob- 
ability distributions [5], whereas the Fisher kernel [S] is the earliest one used as a machine learning method. 
The kernel is based on the score Vglogp{X \ 9) for 9 the maximum likelihood estimate of a model. The 
Fisher kernel is sensitive to the parameterization of the statistical family used and is not easily computable 
without using a surrogate for the Fisher information matrix. The heat kernel [T3] is not sensitive to the 
parameterization, but it is rarely computable in closed form; however, it can be approximated under certain 
conditions to yield a positive definite kernel, e.g. by using leading terms in the parametrix expansion for 
small time parameter t. Although it has not yet been shown, the heat kernel may be (approximately via the 
parametrix expansion) computable for HMMs with multinomial observation distributions. 

Rational kernels [3] are another class of kernels that can handle variable-length sequences. Unlike rational 
kernels, the GMMK is not restricted to observations constituted by a finite alphabet (as is clear from the 
GMMK on probability distributions over M"). 

A possible weakness of kernels which do not take make use of the structure of a probability space 
is that they treat all distributions with disjoint support identically. It is not difficult to show that the 
Bhattacharyya affinity and general probability product kernels produce a similarity of for distributions 
with disjoint support. For the case of kernels based off of the Kullback-Leibler-divergence, a similar result 
trivially holds: 

ii:L(e'i||6'2) = £'eilog--— -^ = / pe,{x)\og{pe,{x))dx- / pe,{x)\og{pe^{x))dx 

= -H{pgJ + log + ^ oo. 

Incorporating smoothness into a similarity measure can alleviate this problem. In fact, we can show that 
the smoothing property can provide theoretical guarantees with respect to complexity of the RKHS. 

6 Statistical Learning Bounds 

To better understand the GMMK km from a statistical learning theory perspective, we first explore the 
complexity of the hypothesis space induced by this kernel. For simplicity of the analysis, we restrict our 



^Guilbart and Suquet actually consider signed measures. 

10 



analysis to kernels on symmetric, univariate location-family probability distributions with a bounded location 
parameter. This restriction ensures that the kernel is translation invariant with respect to the location family 
parameter 9 € Q G M., where O is compact. Recall that location families are parameterized by a location 
parameter 6, so we have that the density function fg{x) = fo{x — 6). With this definition, we can view the 
kernel as a function fc : x — >■ M, where 8 C M is the space of the location family parameter. 

6.1 Covering number bounds on the RKHS 

Prior to bounding the complexity of the hypothesis space induced by the GMMK, we need to introduce some 
notation. Let "H be the RKHS corresponding to the generative mean map kernel k restricted to the above 
location family domain. Denote by B-u{R) the radius R ball of %: 

BH{R):={.f^n:\\f\\H<R}. (35) 

and let Bx '■= Bx{^) be the unit ball in some metric space X. We define Ik to be the inclusion 

id : -H ^ C{X). (36) 

Recall that the e-covering number J\f{M, e) of a subset M of a metric space (X, d) is defined as the minimal 
n e N such that there exist n balls in X of radius e that cover Af ; more formally we have 

n 

U{X,d,e) := inf{n G N : 3{a;i, . . . ,a;„} C X satisfying M C |J Bd{x^,e)}, (37) 

4=1 

where (with mild notation abuse) Bd{xQ, e) := {x ^ X : d(x, xq) < e}. To simplify notation, we may express 
e-covering numbers as Af{E, e) for a Banach space E, where the metric is taken to be corresponding norm 
for E. Finally, for an operator T : E —>■ F between two Banach spaces E and F, the covering numbers of 
the T are defined as 

AA(r,e):=AA(TSB,e). (38) 

For convolutional kernels, Zhou [3T] showed that when F[k] (i.e. the Fourier transform of the convolutional 
form of k shown below) decays exponentially, the covering number of this operator when acting on a radius- i? 
ball in H satisfies 



\nAf{lK{Bn{R)),\\ ■ Woo, V) < Ck.n (\n- \ (39) 

for a constant Ck,n depending only on the kernel and the dimension n of the domain. In our case, n = 1. 

We first show that the Fourier transform indeed indeed does decay exponentially for fc, and then we 
briefly summarize a result from Steinwart and Christmann i22 to arrive at a generalization error bound for 
SVM learning. 

Let a = (2A)^^/^ the variance of the Gaussian RBF function used as the base kernel in k. We have the 
following 

Theorem 1. The Fourier transform of the the convolutional form of k decays exponentially as 

F[k]{uj) = P{ujf cxp(-wVV2). (40) 

Proof. We first show that the kernel is convolutional: 

k{pe,Pe')= / j pe{x)pe'{x')-exp{--{x~ x'Y /a'^)dxdx' 

Pe{x)pe{x' ~ A6I)- exp(-i(a; - x'fja'^) dxdx' = fc(A6l), 
a 2 
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where A^ = 9' — 9. Prior to taking the Fourier transform of A:, we make a few simphfications: 

k{A9)= I f peix)- exp{-hx' - x)'^ /a^) dxpsix' - A9) dx' 

= j f(x')g{M~x')dx', 

where f{x') := j pg{x)l ey.Y>{-\{x' - xf / a"^) dx &uA g{M ~ x') := pg{A9 ~ x'). 

In taking the Fourier transform F[k]{uj) := FT[k{6' — 9')], we apply the convolution theorem to yield 
F[k]{iu) = F{x')G{x'). Now, observe that 



F{x') : = FT 



pg{x)— exp(— -(x' — xY /a'^)dx 
a 2 



FT bo (a;)] FT 



-exp(--a;Vo-^) 



(Convolution Theorem) 



FT[po(2:)] exp(-wVV2). (FT of Gaussian) 



Define G{x') := FT[g(A6' - x')] = FT[5(a;')]. Then (40l follows by defining P{uj) := FT[po{x)]. Hence, 



the Fourier transform of the kernel operator decays exponentially, independent of the decay of P{uj). D 

In many cases, we cannot assume that P{uj) decays rapidly; for the uniform distribution the decay 
horrifically is not a decay at all (the Fourier transform is periodic). Even for the Laplace distribution, the 
decay is only quadratic in w; specifically, the decay is 0{2a/{a'^ + w^)). Hence, the Gaussian convolution 
is critical to the exponential decay; in contrast, the Fourier spectrum of the PPK depends on the Fourier 
transforms of the distributions [10 . Although yet to be shown, we conjecture that the smoothing via the 
Gaussian convolution in fc,„ affords strong regularization properties for more general classes of distributions. 

6.2 SVM Learning bound 

We now show how the above result can be plugged into an oracle inequality for SVM learning with our kernel 



k, by making use of the covering number bound in (39 1. Define Y := {—1, !}• Let TIl,q be the risk defined 



as E(3._j,)^q[L(x, y, /(x)], for a probability measure Q on Q x Y and a loss function L : Q xY xM.. Recall 
that the SVM problem is to minimize the regularized risk functional 



and define 



inf A||/||?,+7^i,z,(/), (41) 



n.p,«:= inf 7^i^p(/)■ (42) 



We now can use an oracle inequality such as Theorem 6.25 of Steinwart and Christmann [22] . restated 
in slightly different and simplified terms here for convenience: 

Theorem 2. Denote by fD,x unique minimizer of the SVM problem with regularization term A over empirical 
distribution D. Let Q C M. be a compact metric space and L : Q x Y x ^ —>■ [0,00] be the hinge los^ 
L{x, y, f{x)) — max{0, 1 — yf{x)}. Further, let T-L denote the RKHS of k and P be a probability measure on 
Q xY , with D ^ P" (iid). Then, for fixed A > 0, n > 1, e > 0, and t > 0, we have with probability greater 
than 1 — e^'^ that 



where the approximation error functional ^2(A) :— inf/g-j^ ''^ll/ll^ + T^L,p{f) — T^l p-h (approaches zero as 
A^O. 



^Notc that the hinge loss is Lipschitz continuous with Lipschitz constant 1 and satisfies L{9,y,0) < 1 for all (9,y) S X Y. 
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Finally, we note that for the kernel of interest, under the restriction specified at the beginning of this 
section, we have the covering number bound 

lnAA(B«,||.||oo,Ai/2e) = lnAA(/K,Ai/2e)<Cfe('ln^') , (43) 

for a constant c^ depending only on the kernel. 

7 Experimental Results 

7.1 Sequence data 

We evaluated the GMMK and LMMK through classification of two discrete observation sequence datasets 
(synthetic and human DNA sequence data) and five continuous observation sequence datasets from the UCR 
time series dataset repositorjr] We explored the performance of SVMs using the GMMK and the PPK (fcm for 
A -^ oo) on discrete observation HMMs, as well as the latent MMK using a model learned from one class. All 
HMMs were learned via uniform distribution initialization for initial state and state transition probabilities, 
random initialization of emission probabilities for discrete observations and constrained fc-means clustering 
initialization for continuous observations, a segmental clustering update via the Viterbi algorithm [T5], and 
execution of the Baum- Welch algorithm until the first of log-likelihood convergence within 10~^ or 1000 
iterations. 

For the GMMK we used the expectation of the Gaussian RBF kernel, where in the discrete case each 
symbol is mapped to a vector using the 1-in-A: encoding, and we explored logarithmically spaced settings of 
the parameter A. For the GMMK we experimented with linearly spaced settings of the witness length T. 
For all kernels, the regularization parameter C was tested at logarithmically spaced values. We compare 
these results with a maximum-likelihood Bayes classifier which maintains an HMM model for each class, the 
Gaussian kernel using the metric induced by the Fisher kernel with a model for one class, and the EMMK 
using Markov models of order^l, 2, 3, and 4. We report all results for stratified 10-fold cross-validation. 

The synthetic dataset consists of 2 classes. For each class, we manually constructed a 3-state 2-symbol 
HMM to serve as a sequence generator. 500 binary sequences of length 100 were generated for each class. 
All HMMs learned had 3 states. Table [l] shows the Bayes classifier and the GMMK perform almost equally, 
whereas the Fisher kernel and empirical/latent MMKs perform slightly worse. The PPK exhibits much 
higher loss than the other methods, possibly owing to its higher sensitivity to errors in model estimation. 

We also ran two-class sequence classification experiments on a random subset of 500 exons and 500 introns 
from the HS"^D dataset [H]. The sequence lengths vary from order of 10^ to 10^ symbols in length, making for 
a challenging classification problem. For the GMMK and PPK, we adopted a heuristic formula ilTj to choose 
the number of states n in the HMM learned for a particular sequence: n = [^^/k'^ + 4(T7 + fc + 1) — ^k\ +1, 
for k symbols and a constant 7 set to 0.1. For the Bayes classifier, we used a 4-state model for the exons and 
an 8-state model for the introns because this configuration produced the lowest loss. For the latent MMK 
and the Fisher kernel, we used a 4-state model trained on the exons. For the EMMK, we report results for 
a Markov dependency model of order 3, which produced the best results among orders {1,2,3,4}. 

We restrict our results for the GMMK to the setting of T = 30 which had the lowest mean loss over 
all values of A. Our results in Table [l] show that the GMMK performs slightly better than the PPK, both 
of which outperform the Fisher kernel. The GMMK significantly outperforms the Bayes classifier and the 
EMMK, and the LMMK performs relatively poorly on this problem. 

Across all of the time series datasets, the GMMK is competitive with or outperforms the other methods. 
On Gun-Point and Wafer, the GMMK well outperforms the rest of the methods. On the other three time 



^We selected all but one of the 2-class problems from the UCR set. The excluded dataset, Lightning-2, would have required 
significant preprocessing. No preprocessing aside from scaling was performed. 

*An order-fc Markov model induces maximal cliques of size A: + 1, with the kernel described in I30| . An EMMK with this 
model effectively is a string kernel whose feature space representation consists of counts of each string of length fc -|- 1. 

^EMMK results on synthetic data were unstable with respect to A. 
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Table 1: Error for methods on sequence data. * indicates that a class-balanced random subsample was used, 
indicates that experiments were not performed due to computational intractability. 



■'X" 



# observations 


GMMK 


PPK (p = 1) 


PPK (p = 0.5) 


EMMK 


LMMK 


Fisher 


Bayes 


(1000) Synthetic 


0.063 


0.091 


0.247 


0.067 


y 




0.068 


0.066 


0.062 


(1000) HS' D 


0.144 


0.149 


0.165 


0.215 


0.279 


0.170 


0.192 


(56) Coffee 


0.232 


0.214 


0.268 


0.286 


X 


0.268 


0.339 


(200) EGG 


0.250 


0.250 


0.240 


0.320 


X 


0.335 


0.335 


(200) Gun-Point 


0.165 


0.230 


0.265 


0.185 


X 


0.385 


0.235 


(1624* ) Wafer 


0.077 


0.231 


0.144 


0.173 


X 


0.104 


0.245 


(1000*) Yoga 


0.304 


0.304 


0.341 


0.340 


X 


0.305 


0.422 



0.26 
0.2 

0.15 
0.1 


-0.05 

-0.1 
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Figure 2: I''' vs 4**" kernel principal components (kPGs) of the GMMK matrix on INBio data for A = 1. Plants, 
molluscs, arachnids, insects, and fungi are triangles, circles, asterisks, x's, and crosses respectively. The plot is similar 
for the 1=' vs 2""^ kPGs (not shown). 



series datasets it is at least competitive. Interestingly, the EMMK often performs well, 
plan to explore the performance of its generalization, the latent mean map kernel. 



In the future, we 



7.2 Biodiversity data 

We also applied the GMMK to visualize ecological relationships between species sampled in Costa Rica 
by INBicp^ For each species, observations consist of counts on a grid, where each grid location has the 
accompanying abiotic features altitude, median annual temperature, annual precipitation, isothermality, 
and temperature seasonality. Because estimating spatial density is very suspectible to the fact that this 
is presence data, we instead estimate density in the abiotic feature space. Each species has a true density 
in this 5-dimensional space, which we estimate using Gaussian RBF KDEs. For each species sampling, we 
optimized the bandwidth of a KDE by minimizing the integrated square error |28j . After obtaining the 
optimal bandwidth for each species sampling, we formed the kernel matrix between all pairs of the species 
KDEs using the GMMK with A = 1. We then applied kPCA to identify whether particular components 
well-separate certain groups of species. The visualization of the 1^* and 4*^ kernel principal components in 
Figure [TT] shows some separation between the different groups: the insects and arachnids are clustered near 
the center, with plants toward the left and molluscs toward the right. 



■^The National Biodiversity Institute of Costa Rica, http://www.inbio.ac.cr/en 
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8 Conclusion 

We have discussed results on two generative kerenls, each embodying a different family of smoothed similarity 
measures between probability distributions. We have shown the first learning theoretic guarantees for the 
generative mean map kernel under certain conditions, and we have introduced the latent mean map kernel, a 
natural extension to Hilbert space embeddings of empirical distributions. Unlike other generative kernels, the 
GMMK both can operate on latent variable models and enjoy certain generalization error bounds independent 
of the Fourier spectra of the distributions it is comparing. We hope to generalize this result to a larger 
class of distributions in the future. Experimental results on discrete and continuous data suggest that the 
GMMK can outperform and is at least competitive with the PPK, the EMMK, the Fisher kernel, and 
the Bayes classifier. Using sampling techniques |n], it is straightforward to extend the GMMK to more 
expressive models such as factorial HMMs and Markov random fields. For models with continuous random 
variables, such as HMMs with mixture of Gaussians observation distributions, evaluating the LMMK is 
highly expensive computationally; further testing in the continuous domain demands that we first overcome 
these computational challenges by using approximations of explicit representations of RKHS elements. 
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